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ABSTRACT 


The process of creating an intelligence status report requires a continuous 
collection of intelligence from various sources of varying inaccuracies and reliability. 
Consequently, managing many intelligence sources is not only a costly operation to 
establish but also to maintain continuously. Our objective is to use the Multi-armed 
Bandit (MAB) framework to model intelligence collection. The proposed framework 
generalizes the classical MAB model by accounting for censoring in sampled 
observations in a resource-constrained environment. We devise an online optimization 
framework, accompanied by rigorous analysis and comprehensive numerical 


experiments, that sheds light on this real-world problem. 
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Executive Summary 





This work focuses on the problem of intelligence collection in uncertain environment under 
budgetary constraints. The problem tackled by the research is difficult not only due to the 
uncertain nature of information gained, but also since the value of the information is often 


tied to the level of effort required to collect that information. 


Our research provides an optimization framework, based on Multi-armed Bandits and 
online optimization, to address the problem. The framework is novel, as it incorporates 


together several modeling elements that have been previously explored only separately. 


We model information gained from intelligence sources by censoring. That is, efforts to 
extract information from an intelligence source must exceed an unknown threshold to be 
successful. Censoring of information ensures that the decision maker must spend sufficient 
resources to collect valuable information. The goal of the framework is to maximize the 
value of information gained and observed (not censored) in a limited time period under 


budgetary constraints. 


We provide rigorous analysis of the proposed optimization problem, and propose an elegant 
algorithm, called (K + 1)-UCB, to solve it. The algorithm aims to explore each intelligence 
source as much as possible. By fusing the information gained from all intelligence sources, 
the algorithm uncovers an exploitation point - an optimal allocation of the budget that can 


maximize the value of information observed. 


Our analysis shows that the algorithm performance is competitive with other algorithms 
designed to solve similar problems. Through various numerical experiments, we support 
our theoretical analysis and show that in practice our algorithm performance is often better 


than existing algorithms. 


Finally, we highlight potential future work within the proposed framework that may be able 


to further improve our results. 
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CHAPTER 1: 
Introduction 





1.1 Motivation 

Modern intelligence collection faces many challenges, and collectors of intelligence must be 
able to practice many disciplines to create a reliable intelligence picture [1]. The importance 
of maintaining a plethora of informative intelligence sources cannot be overstated, as also 


evident by the billion dollar U.S. Intelligence Community Budget [2]. 


The large budget and the multitude of disciplines aim to mitigate the highly-uncertain nature 
of intelligence collection. Typically, one can model uncertainty of an intelligence source via 
a random variable, as a method to account for both variability of information and enemy 
mitigation [3], [4]. We wish to expand on this modeling approach of intelligence as a random 
variable by also tying the information gained from an intelligence source to the resources 


the intelligence community is willing to commit to process that source. 


An online learning framework can be very attractive to explore in the context of this problem 
as the framework can model the time-critical decision making process. Our goal is to use 
this framework to appropriately model the uncertainty tied to both Blue and Red actions to 


the information gain and design methods to optimize that information gain. 


1.22 Model 


A decision maker is given the task of allocating resources (such as manpower, budget or 
otherwise) to the intelligence community, and specifically, to different intelligence sources 
or operations. Each source or operation yields useful information, referred to as a “reward,” 


which is valuable to the decision maker. 


Importantly, the reward each source can yield is tied to both Red and Blue actions in 
the following ways. Firstly, we assume for simplicity that each non-negative reward is 
independently sampled from some distribution. These distributions are unknown to the 


decision maker. Secondly, if the resources allocated to a certain source are insufficient, 


a reward may be censored from the decision maker. A censored reward implies that no 
information (that is, information of zero value) was gained from a source. The problem 
faced by the decision maker is to allocate resources in order to maximize the expected value 


of information collected. 


1.3 Novelty 


We focus our attention to the Multi-armed Bandit (MAB) [5] framework in order to solve 
the problem. The MAB framework can model each intelligence source as a different arm 
with an unknown underlying distribution of rewards (information). In addition, we 
incorporate both a budget allocated to play each of the arms as well as a censoring 
mechanism that can eliminate rewards (setting them to zero) in a manner described in [6]. 
We note that while many of these elements have been explored in the context of MAB in 
the past, our literature review (Chapter 2) shows that no singular work attempted to 


incorporate them all together, let alone in the context of intelligence collection. 


1.4 Research Questions of Interest 


Our research is mainly concerned with the following questions: 


¢ What is the best approach to collection information from intelligence source in order 
to evaluate it, given a budget and the censored-nature of the information? 

¢ What assumptions are required in order to address the problem algorithmically? 

¢ Given the answers to the previous question, can we devise an algorithm to address 
the problem, despite the censored-nature of information? What is the performance of 
such an algorithm? 





CHAPTER 7: 
Background 





In this chapter, we shall introduce the online optimization framework and the MAB problem. 
We will note different algorithms designed to address the MAB problem and its variants. 
We will also discuss how different MAB variants can prove useful for the modeling of our 


problem. 


2.1 Online Optimization Framework 

We begin by describing a general online optimization framework. An online optimization 
framework includes a learner (e.g., a decision-maker) that interacts with a time-varied 
environment in an attempt to maximize reward. Formally, the learner is given a fime- 
horizon T € N and each time-step ¢ = 1,2,...,7, must decide on a certain K-dimensional 
action x, € S c RX, where S denotes the set of allowable policies. Once an action has been 


set, the learner receives feedback in the form of a time-dependent reward function /;(x;). 


A policy x dictates the actions taken by the learner over the time horizon, m = {x,}/_,, and 


t=1 
typically depends on the information collected so far. The goal of the learner is to minimize 
the regret from deviating from the optimal policy. Formally, the regret of a policy 7 is 


defined via 


ro(at) = | max, 5 fica) - YB). (2.1) 


SH ar 


If the functions f; for t = 1,...,7 are convex, and the set S is also convex, this framework 
is also known as Online Convex Optimization (OCO) [7]. The convexity property of the 
problem allows us to address the problem using Online Gradient Ascent (OGA) [7]. The 
OGA algorithm continuously improves the chosen action x; at each time-step by observing 
the gradient of the function f; at the point x, and taking a proximal step in the ascent 


direction projected by the gradient. 


2.2 From Online Optimization to MAB 

The classical MAB problem is inspired by the notion of a gambler aiming to choose the best 
sequence of one-armed bandit machines to play each time [5]. The gambler’s decisions are 
based solely on past observations, and the gambler can only play one machine each time. 


This problem can be cast as a special case of the online optimization framework. 


In the MAB problem, we have a set of allowable actions Syag = {e; € R*}, where e; is 


h coordinate is 1 and all other coordinates are zero. Each 


a standard basis vector, whose i’ 
action plays one of K arms, whose underlying reward distribution has a mean y;,i € [K]. 
We let ys € R* be the vector whose coordinates are pj, so that E( f;(x,)) = p’x;. Since 
x; = e; for some /, then E(f;(x;,)) = s;. The motivation behind exploring the MAB within 
the online optimization framework is that it allows to easily expand the problem to support 


mixed policies, that is x ¢ Syap. 


We note that this formulation of MAB is not an OCO. While the functions /f; can be expressed 
as the linear functions f;(x;) = wx, the set Syap = {e; € R*} is not convex. If we wish 
to address the MAB problem using the OCO framework, one potential relaxation [7] is to 


replace S with the K—dimensional simplex, denoted Ax. 


2.2.1 Solving the MAB Problem 


In this section, we discuss two important algorithms used to solve the MAB problem: Upper 
Confidence Bound (UCB) and the Flaxman, Kalai, and McMahan (FKM) algorithm [7]. 


UCB 

MAB algorithms aim to balance between exploration of different arms and exploitation 
of arms whose mean reward is the best sampled so far. The UCB algorithm [5] balances 
exploration and exploitation by adopting the “Optimism Under Uncertainty” principle, that 
is, by assuming that arm means are as high as possible, based on what observed. The UCB 
algorithm makes use of indices for each arm that serve as a ranking mechanism that guides 
the algorithm which arm to play at each time-step ¢. It has been shown [5] that the regret is 
logarithmic in the time-horizon, rr(aycg) = O(logT), where mycz is the policy dictated 
by the UCB algorithm. A pseudo-code of the UCB method is shown in Algorithm 1. 


Algorithm | defines an index for each arm to balance between exploration and exploita- 
tion. With exploration, the algorithm spends a time-step to sample an arm to increase its 
confidence in its evaluation of the mean. With exploitation, the algorithm will sample the 
arm deemed most valuable. UCB implements this balance elegantly by accounting for a 
confidence interval for each sampled mean of each arm. The algorithm takes an optimistic 
approach in the sense that we only use the upper-bound of the confidence interval - hence, 


the name Upper Confidence Bound. 


Importantly, we note that the form of the upper-bound of the confidence interval is dependent 
on both the current time-step ¢ as well as the number of samples of each arm Nx +. As t 
increases, the confidence interval becomes much looser for under-explored arms, allowing 
the algorithm to perform exploration of those arms. If Nx.+ increases, the confidence interval 
becomes tighter, offsetting the natural expansion of the confidence interval. Observe that 
the confidence interval dependency in those factors that allow this behavior is of the form 
o(t)/Nx.t. Further details can be found in [5]. 





Algorithm 1: UCB for Classical MAB 
Input: K ¢ N, (number of arms), 





T € N, (time-horizon) 
Output: zycg = {x,}/_, (UCB policy) 
Vk = 1, re a : UCBxo = co, Neo = 0, fix. <—0 


fort=/, 2,..., 7 do 
ky — arg max,z=1,..x UCB, +-1 (choose best arm) 


Xr — Ck, 

y, — fi(X;) (play arm i, and collect information of value y;) 
Nx,,t — Nk,t-1 +1 

fk, t — (Bk, t-1 + Yt) Net 

UCBx,2 — fk, + V2 logt/Nx, 1 (update UCB index) 





end 











FKM Algorithm 
While the OGA algorithm is a compact method to solve problems formulated as OCO, we 
cannot directly use it for the MAB problem. First, as mentioned above, we must relax the 


requirement on the set of allowable policies from S = {e; € R‘} to Ax. Second, the MAB 


problem does not have access to the gradients of the functions f;. 


In order to address these issues, a variant known as FKM [7] can be used to estimate 
the gradients of jf; without the additional assumption that the learner has access of such 
gradients. For the FKM algorithm, it was shown that the regret is rr(a#rxKy) = O(T?! a 





Algorithm 2: FKM algorithm for Convex Online Optimization 
Input: K ¢ N, (number of arms) 





T € N, (time-horizon) 

6>0 

p>o 

Output: 2-«4 = {x;}7_, (FKM policy) 

IK] <— 0 

for +=, 2,..., T do 

Randomly generate u,; € S; = {u € R*||Jul|2 = 1} 
Zt <— X; + OU; 

Yr — fr(Z) (play action z,) 

2 ® fi(Z:)U, 


X41 = Pago) (Xr + PB) (take a projected-gradient step in ascent direction) 





end 











We note that the algorithm requires the use of Pa, (5), an orthogonal projection operator 
onto the set Ax(6) = {x € R* |x € Ax}. There are several algorithms existing for this 
purpose such as [8] and [9]. Importantly, since FKM projects onto Ax (6) and not Ax, the 
choice of 6 affects the actual set of proposed policies by FKM. 


2.3. Variants on the MAB Problem 


We now focus our attention on certain variants on the original MAB problem. These variants 
introduce additional components to the classical MAB problem that make the MAB problem 


more challenging but also more relevant to our modeling efforts. 


2.3.1 Continuum MAB 

Kleinberg [10] has explored a variant known as “Continuum MAB”, where x; is not bounded 
to the classical definition of S discussed above. Specifically, the work in [10] introduces an 
optimal algorithm for the specific case where K = 2. 





Algorithm 3: Kleinberg’s Continuum MAB for K = 2 Arms 
Input: 7 € N, (time-horizon) 





Output: 7K /einberg = {X:}/_, (Kleinberg’s policy) 
ee | 

while ¢ < T do 

K; — (t/log pi 

Initialize MAB algorithm over K; arms 


for t, =t,t+1,...,min(2t — 1,7) do 
i, <— Best arm from MAB algorithm 


X; (z. - z)" (transform the discrete arm i; into a 2-dimensional point) 
y, — f;(X;) (play 2-dimensional point corresponding to i;) 
Update MAB algorithm with reward y, from playing i, 

end 





t< 2t 
end 











Algorithm 3 details Kleinberg’s approach for the K = 2 case. The algorithm divides the 
time horizon 1, ..., 7 into phases, each with twice as many time steps as the previous phase 
(but such that the total number of time steps is still 7). In each phase, an MAB with K; arms 
is used, with each arm representing a point in 2-dimensional space. For i € [K;], the i’” arm 
is the vector (ge. 1- a € A’. If the internal MAB used in each phase chooses a specific 
arm index / to be played, Kleinberg’s method then samples the corresponding vector. 


The method proposed by Kleinberg runs an MAB algorithm with each discrete arm index 
mapped to a point in continuous space. By iteratively increasing the number of arms, the 
internal MAB algorithm can sample more points in the R* space and further approach 
to the optimal point. Consequently, Kleinberg’s work ties together the discrete nature of 
MAB and the continuous nature of the problem. Kleinberg’s algorithm achieves a regret of 
'T(TKleinberg) = O(T?! 31 [10]. The importance of Kleinberg’s algorithm is the fact it can 


work in a continuous space without needing to tune a step-size or approximate the space 


f 


with a parameter as FKM requires. 


We note that Kleinberg et al. developed a generalization of their technique to arbitrary 
metric spaces in [11]. In their work, an algorithm called the zooming algorithm is capable 
of zooming into a ball in the metric space in which the optimal policy lies. The regret of 
the zooming algorithm is of order O(log T - TEs) for the K-dimensional Euclidean space, 


provided the rewards are Lipschitz [11]. 


2.3.2 Censored MAB 
Abernethy et al. [6] addressed a variant of MAB where feedback from each arm may be 
censored according to an unknown threshold and therefore not observed by the learner. 


In this setting, each arm has an underlying unknown but constant threshold. When the arm 
is sampled, the reward is either 1 or 0, depending on whether the sample surpasses the 
threshold or not. We refer to this type of censoring as “self-censored” since the censoring 


threshold are fixed and are independent of the specific policy of the learner. 


Abbernethy et al. discuss two variants for this setting —one with feedback on the sample 
and one with feedback only on samples that surpass the threshold. The authors propose 
an algorithm to address each variant: the Dvoretzky-Kiefer-Wolfowitz Inequality based 
Upper Confidence Bound (DKWUCB) algorithm for the variant with feedback and the 
Kaplan-Meier based Upper Confidence Bound (KMUCB) algorithm for the variant without 
feedback. Both algorithms are shown to have a regret that is of order O(log T) [6]. 


2.3.3 Budget-Constraint MAB 
Zhou and Tomlin tackled a budget-constraint variant of MAB, where multiple plays are 
allowed at each time-step [12]. Their approach focuses on the combinatorial aspects of 


choosing which arms to play, resulting in a time horizon-independent regret. 


The setting proposed by Zhou and Tomlin suggests a fixed budget 0 < B € R, to be 
allocated across the entire time horizon. Moreover, the time horizon T depends on the 
specific allocation of the budget and the algorithm terminates when the budget runs out. 
Consequently, while we can draw inspiration from the algorithm suggest in [12], we believe 


its applicability to an intelligence-collection setting is limited, as time is a critical factor in 


our setting. 


We note that the algorithm, called Upper Confidence Bound for Multi-Play with Budget 
(UCBMB) algorithm, is a budget-aware version of UCB that plays a fixed-number L of 
arms with the highest UCB indices at each iteration. The regret performance of UCBMB is 
shown to be of order O(K L* log B). 


2.4 Intelligence-collection Using MAB 

We end this chapter by briefly discussing how the different MAB variants (summarized 
in table 2.1) can contribute to our model. The MAB variants we explored introduce new 
elements introduced to classical MAB: continuous decision-space, censoring of information 
and budget-constraints. We wish to incorporate all of these elements in some way into our 


model. 


As noted in Chapter 1, we aim to assist decision makers in resource allocation when 
the intelligence collection plan is created. As such, incorporating a budget-constraint is 
desirable. However, we shall note that our model should still be limited by a time horizon, 


as intelligence collection is a time-critical procedure. 


The continuous decision-space (Continuum MAB), as opposed to the discrete decision- 
space used in classical MAB, better reflects trade-offs a decision maker faces where mul- 
tiple option are present. In other words, there is a continuum of intelligence collection 


opportunities, rather than a discrete set of opportunities. 


Importantly, we can use censoring of rewards from the decision maker to represent an 
interaction between Red and Blue actions. That is, Blue needs to invest resources in order 
to stand a chance to receive a reward. The randomness of each intelligence source (arm) in 
the classical MAB captures the variety of information Red keeps in each individual source. 
The censoring allows us to model potential obscurity of information Blue can receive from 


a given source, if they do not allocate sufficient resources for exploring that source. 


By combining these three elements into our proposed model, we will be able to better express 


the complexities of resource-allocation for intelligence collection. From an algorithmic 


Table 2.1. Summary of MAB Algorithms 


























Problem Property Algorithm Assumptions Performance 
Classical MAB UCB [5] None O(log T) 
Continuum MAB Kleinberg’s Algorithm [10] K=2 O(T7!) 

Continuum MAB Zooming Algorithm [11] Lipschitz rewards O(logT - Ts) 

Continuum MAB FKM Algorithm [7] None o(T?/*) 
Censored with Feedback DKWUCEB Algorithm [6] Self-censored O(log T) 
Censored without Feedback © KMUCB Algorithm [6] Self-censored O(log T) 

Budget-Dependent Horizon UCBMB [12] Time-independent O(KL* log B) 











perspective, to the best of our knowledge, addressing such a model with all of the above 


elements has not been attempted before. 


2.4.1 Additional Related Work 
While the elements introduced in our model are novel, we acknowledge that an MAB or 


Online Optimization-inspired setting for intelligence collection has been previously devised. 


We specifically pay attention to the works in [3] and [4]. The work in [3] is specifically 
tailored for intelligence collection in the cyber domain and incorporate domain-specific 
elements into the model, such as the network structure from which information is extracted. 
In [4], the authors specifically address the process of intelligence collection including 
extraction of information, processing and analysis. As described in Chapter 3, we abstract 
this chain into a single step in our model that captures the collection, processing and analysis 
(evaluation) of information. We further note that the models presented in these work make 
some specific assumptions regarding the distributions involved. We will refrain from making 


any specific assumptions regarding any distributions. 
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CHAPTER'S: 
Proposed Model and Algorithm 





In this chapter, we present a mathematical formulation of our model and discuss the impli- 


cations of introducing multiple mechanics into our model. 


3.1 Model Formulation 

A decision maker is provided with a set of K € N intelligence sources and a time horizon 
T € N. During each time step ¢ = 1,2,...,7, each intelligence source k ¢€ {1,...,K} can 
provide a piece of information of some value. We model the value of that information using 
a random variable X,. For all k € [K], X; are independent random variables with support 


LI CR. We also refer to Z as the information space. 


To collect information from a given source, the decision maker must pay a cost. For each 
k € [K], we assume the existence of an information-cost function, C, : J — R,. Paying 
a cost C;(x) guarantees that all information with value at most x € I will be accessible to 
the decision maker from source k. Since the value of information is random, if the value 
generated from source k is greater than x, the decision maker will not be able to observe 
it due to insufficient resources. In other words, the value observed by the decision maker 


becomes 0. 


The decision maker is limited by a budget, B € R, for each time step ¢ = 1,...,7. The 
budget can be allocated freely between any number of intelligence sources, and any portion 


of the budget not spent in a specific time step is lost. 


In this model, we use the expected observable value of source. It is the expected value 
of information up to a certain specified value within J, referred to as the information 
threshold. The expected observable value expresses the expected value accessible to the 
decision maker. Formally, the expected observable value from a source k € [K] at time f, 


given an information value threshold x € £, is expressed through 
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E(X;;X,4 <x) = / ZAK(Z)L(z < x)dz, (3.1) 
zel 

where f; is the probability density function of intelligence source k at all times ¢ = 1,..., T, 

and J is an indicator function. The goal of the decision maker is to maximize the total 


expected observable value of all sources given the budget B in each time step ¢ = 1,..., T, 


K K 
max y E(Xx; Xx < xXx) subject to > Cy (xz) < B, (3.2) 
Soe k=l 


where x; denotes the k’” coordinate of a K—dimensional vector x € I*. 


3.1.1 Connection to the Online Optimization and MAB Frameworks 
Equation (3.1) describes the goal of the decision maker at each time step ¢ € {1,2,...,T}. 
In practice, while the functions Cx, are known to the decision maker, the distribution of each 
X; (and consequently, the structure of the function E(X;; X; < x)) is not available. 


We can recast the formulation as an online optimization problem. At each time step f, the 
decision maker decides on an allocation action x, € J*, where x, is a K-dimensional vector 
and J* denotes the K-dimensional space whose coordinates all lie in . The learner then 
receives feedback of the form /f;(x;) = ey Xxtl(Xke < Xx), where xz is a sample of 
the value of information of source k at time step t. Observe that, by definition, we have 
E(ft(x;)) = ye E(X,; X~ < xx), which is exactly our objective function from equation 
(3.1). Maximizing the expected value of f; is there equivalent to minimizing the regret of 
the online learning problem induced by /;, as defined by equation (2.1). For completeness, 
note that the set of feasible policies S is given by 


K 
S={xe I >) Calan) < BY. (3.3) 
k=1 


The shape and properties of the set S depends on the specific information-cost function, 


as discussed below. We note that in most cases, however, it is a set defined within the 


i 


continuous domain. Consequently, when casting the formulation in equation (3.1) to the 
MAB framework, we must use a continuous decision space, such as the one described as 
Continuum MAB in Chapter 2. 


The MAB interpretation of the formulation is straightforward, otherwise: in each time step, a 
decision maker allocates a budget B to a set of K arms based on past observations from each 
arm. If the cost paid for a certain arm is below its produced value, that is, /( Xx 4 < xx) = 9, 


the value is never observed by the decision maker. 


3.2 The Information-cost Function 

The information-cost function is assumed to be non-decreasing, that is, the decision maker 
must pay more in order to observe higher values of information. If the information-cost 
function is invertible within J we can transform the optimization problem in equation (3.2) 


from an information-space problem to a cost-space problem: 


K K 
max bY E(X,; Xx < Ci) subject to Day <B (3.4) 
ERP Gel k=l 


One potential advantage of this cost-space formulation is that the constraint is linear in 
the decision variables. In fact, by denoting y = BX we can rewrite the formulation as an 


optimization problem over the K-dimensional simplex A‘: 


K 
max )'E(X43 X~ < Cy! (Byx)), (3.5) 
yeAk = 

where y, is the k’” coordinate of the vector y. Moreover, we are specifically interested 
in information-cost affine functions. If we let Cy(z) = agz + bx, then we can recast the 


inequality X; < Cr! (Bye) as X, < (Byx — bx)/az, and we can once more rewrite the 


formulation as 


K 


B bx 
—E(v%3 Yn < -_— 3.6 
soa ben (Yas Ye < yx) — =] (3.6) 
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where Y, = (agX,+b,)/B. The above formulation benefits from a simple constraint set (the 
decision space is the K—dimensional simplex) and a relatively simple form for the objective 


function that depends directly on the censored mean function of the random variables 
K 
{Yehe: 


3.2.1 Assumptions In This Work 

Unless noted otherwise, we shall hereafter assume that J = [0,1], B = 1 and C;, (x) = x for 
all k € [K]. These assumptions will allow us to better analyze the properties and algorithms 
of the model, and it is standard that they generalize the problem for any bounded interval. 
The problem, given these assumption, is 


K 
max > E(Xu; Xk < Xxx). (3.7) 
k=1 


xeAK 


We note that since the information-cost function is the identity function, we can say that 
the decision maker operates within the cost space or the information value space. We will 


therefore refer hereafter to x, as the cost paid to gather intelligence from source k. 


3.3. Estimating the Objective Function 

A unique challenge for this model is that the decision maker can make a decision resulting 
in no new information at all. As a result, estimating the censored mean of each intelligence 
source does not only require the decision maker to allocate resources to spend on a certain 
source, but also requires the decision maker to allocate sufficient resources. This section 
discusses how we can estimate the censored mean of a source based on the policies employed 


by the decision maker. 


To simplify our notation, we will consider how to estimate the expected observable value 
of a single source whose value is a random variable X with support on [0, 1]. We consider 
a series of n collection attempts, designed in two fashions: one uses the entirety of the 
budget towards a singular source, the other splits the budget between two sources. These 
two types of policies shed light about the limits of different sampling strategies, and can be 


generalized to address an arbitrary number of sources to focus upon. 
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3.3.1 Full-cost Policy Estimator 

Consider this case: we use the entire budget on a singular intelligence source throughout all n 
attempts. Since the budget B = 1 > maxye; v, all n attempts are uncensored and completely 
observable by the decision maker. To simplify our notation, for x € [0, 1] denote 


U(x) = E(X; X <x). (3.8) 


U(x) is the true expected observable value for some threshold x € [0,1]. Using the set of 


samples {x;}‘_,, we can construct an estimator for p(x) via 


n 


fi(x;n) = * Yala <x). (3.9) 


i=l 


We refer to fi(x) as an empirical censored mean function at point x € [0,1]. Note the 
indicator expression J(x; < x) nullifies the i” sample if its value is more than the parameter 
x. If x = 0, fi(x;n) = 0, whereas if x = 1, fi(x;n) is simply the sample mean of all 


uncensored samples. 


Since we do not have access to the censored mean function directly in our setting, the 
empirical censored mean function is critical to our ability to estimate the objective function. 
The challenge in evaluating (x) is that samples of X might be censored if the decision 
maker does not allocate enough resources to observe them. Consequently, even if we attempt 


to sample a certain source n times, we may end up with fewer usable samples. 
How good is this proposed approximation of the censored mean function? The following 
lemma addresses this exact question. 


Lemma 3.3.1 For any 6 > 0, given n IID uncensored samples of a source: 


P( sup (xin) — w(x)| > 5) < 4-672”, (3.10) 
x€] VU, 


The proof of this lemma is given in Appendix A.1. The above results implies that the error 


for a given threshold decays exponentially with the number of samples. Also, the error 
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threshold is controlled by an observer and can be independent of the distribution of the 
intelligence sources. This strength of the proposed estimator also has an inherit weakness - 
lemma 3.3.1 requires n uncensored samples from a singular source. If there are K different 
sources, we need to collect Kn uncensored samples in order to achieve the same level of 


error for all of them. In other words, the cost is linear in the number of intelligence sources. 


3.3.2 Twin-cost Policy Estimator 

In the previous section, the simplicity of the full-cost policy estimator was apparent due to 
the bypassing of the censoring altogether. In this section, we wish to expand the full-cost 
policy estimator. 


A natural expansion to the full-cost policy estimator is dubbed the twin-cost policy estimator. 
We derive this estimator for the case where K = 2. Here, we pay a fixed cost 0 < 6 < 1 for 
one source, and 1 — f for the other source. The twin-cost policy estimator allows the learner 


to sample the simplex space A? at the 2-dimensional points (8, 1 — 8)’ and (1 — B, B)’. 


For a specific source, assume the learner pays 6 exactly ng times and pays | — £ exactly 
ni_g times. Without loss of generality, assume that 6 < 1/2 < 1 — £. In order to estimate 
the expected observable value of the random variable X at some point x € [0,1], we have 
to consider the exact point of evaluation. If x > 1 — 6, then x is beyond any observable 
sample and we cannot tell anything about the expected observable value. Conversely, if 
x < B, then all uncensored samples from amongst the ng + n-g collection attempts can be 
used to estimate the expected observable value in a similar manner to the one presented in 


subsection 3.3.1. 


The case where x € (8, 1 — B] is the most complex. We must carefully consider how to use 


uncensored samples collected. To do so, observe that for some x € (6, 1 — 8] we have 


x Bp x x 
wis)= f sfide)de = f cfeloae+ f efile = mB) + f zfe(z)dz. (3.11) 


— 
residual 
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Since all samples equal to or less than # than uncensored, we can use all of the ng + n\_, 
samples to estimate (8), then use the remaining uncensored samples to estimate the 


residual. 


To formalize this observation, let Ug and U\_g denote the subset of uncensored samples 
for each cost. We denote the proposed estimator via fi (x; B,n B> M1), 


fi(x;ng + n-2) x<f 
A (x; B,ng, mip) = 4 (Bing + mg) + ee (ze (B,x]) B<x<1-p 
0 x>1-8. 


(3.12) 


The accuracy of this estimator highly depends on the choice of 6 and the cumulative 
distribution function of X. We shall not provide an analytical analysis of the estimator 


performance, but we will explore the estimator numerically in Chapter 4. 


3.3.3 Beyond Full-cost and Twin-cost Policy Estimators 

The strength of both the full-cost policy estimator and the twin-cost policy estimator stems 
from their simplicity. However, we note that more complex and involved methods exist, 
such as control variates estimators [13] or the Kaplan-Meir survival analysis method [14]. 


We leave these estimators for future work on the subject. 


3.4 From Model to Algorithm 


Equation (3.7) defines the optimization problem we wish to solve using online optimization 
and MAB techniques. In this section, we will explore simple proposed algorithms to address 
this optimization problem. Some algorithms in this section use an estimator function, as 
discussed in section 3.3. We denote the estimator at time step ¢ € {1,..., 7} by F;(x). The 
performance of the algorithm greatly depends on the performance of ¥;(x), as we shall 


discuss in this section. 
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3.4.1 The (K + 1)-UCB Algorithm 

We begin by introducing the (K + 1)-UCB algorithm. Inspired by the original UCB (see 
Algorithm 1), this algorithm exploits the fact that B = max,ey v = 1, by sampling the K 
corners of the K-dimensional simplex, denoted by e; for k € [K]. In other words, the 
exploration rounds spend entire budget is allocated to collect information from a single 
source at each time step. Importantly, since B = 1, no sample in this approach is ever 


censored, thereby nullifying the censoring effect. 


The exploitation rounds make use of an “extra” sampling point (hence, the +1 in the 
algorithm’s name) that uses all samples collected so far and an estimator ¥; (x) to estimate the 
maximal value of the objective function on the K-dimensional simplex. Since no samples are 
censored, the estimator ¥;(x) is empirical censored mean function introduced in subsection 
3.3.1, 


K 
Fi(x) = )) fess New), (3.13) 
k=1 


where Nx; is the number of samples collected from source k € [K] up until time ¢, and 
fixs(x; Nx) is the empirical censored mean of all samples collected from source k € [K] 
up until time f, as defined via equation (3.9). The algorithm is shown below as Algorithm 
4. The total value of information collected at time step ¢ is given by pe Xl (Xk < Xk): 
Note that X,,J(X¢ < x,) can be zero if the value of information is higher than x;,. 


However, during exploration, since B = | and x; # 0 for exactly one intelligence source k, 


no censoring Can occur. 


The (K + 1)-UCB is a simple and effective approach as it eliminates complex elements 
introduced in our model. However, this approach is not very effective from a practical 
standpoint, since allocating the entire budget on a single source at time step is a potentially 


wasteful policy. Nevertheless, this algorithm is very attractive to introduce because it serves 


18 


as a baseline, and its performance is analytically traceable. 





Algorithm 4: (K + 1)-UCB 
Input: K ¢ N, (number of sources) 





T € N, (time-horizon) 

B = | (budget) 

é eR, 

Output: 77(«41)-uce = {Xr}/_, ((K + 1)-UCB policy) 

Wk =1,...,.K : UCByo — ~, Ngo — 0, fixo — 0 

UCB K+1),.0 — 9 

for t= /, 2,..., 7 do 

ky, — arg maxxa,..x,.K+1 UCB, ;-1 (choose source to explore or mixture to exploit) 


Ck, k,#K+1 


x < 
arg max,cax F(x) ky =K+1 
Yr — Dey Xcel (Xer < xe) (play arm i;) 
if k, # K +1 then 
Nxt <— Nk, t-1 +1 
ft — (Mk, t-1 + Yt) / Nig t-1 
UCB — fix, 2 + VEt2/3/Nz, 4 (update UCB index according to the index type) 
end 





UCB x41) — Maxyeax F;(X) (Optimize the estimator) 
end 











Algorithm 4 balances between exploration and exploitation differently than the original 
UCB algorithm (Algorithm 1). The algorithm performs exploration steps by sampling the 
K corner points of the K-dimensional simplex A*, each corresponding to a full exploration 


of a different intelligence source. 


Using the information collected through exploration, the algorithm uses the estimator (3.13) 
to find a potential exploitation point. The exploitation point divides the budget B between 
the intelligence sources, in a manner that is deemed optimal given the current information. 
Consequently, the point of exploitation is in fact a mixture of collection attempts from 
different sources, and is not limited to just a singular source. The next section details how 


to uncover this exploitation point. 
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3.4.2 Optimizing the Estimator 

As noted in Algorithm 4, a critical step in the implementation requires an optimization of 
the estimator ¥;(x). In this subsection, we discuss several techniques that can be used to 
implement this step. We draw inspiration from techniques discussed in Chapter 2. 


First, note that if K = 2, we can use an exhaustive search to find the optimal value of 
x € [0,1] such that ¥;(x) is maximized. We note that, while accurate, this approach is 
not the most efficient. In fact, for K > 2, this approach becomes infeasible in terms of 


computational power required. 


Our go-to approach is to therefore employ the FKM algorithm (Algorithm 2). We use our 
estimator of the objective function ¥;(x) to perform a few offline gradient steps, typically 
5. We start our search from the center of the unit simplex, whose coordinates are all 1/K. 
Following each gradient ascent step, we project the resultant vector onto the unit simplex 
using the method described in [8]. 


3.4.3. (K + 1)-UCB Performance 
In this section, we discuss the performance of the (K + 1)-UCB algorithm. For our analysis, 
we use a similar technique to the one presented in [6]. The main result is stated below in 


corollary 3.4.1: 


Lemma 3.4.1 The regret rr(™(x+1)-ucp) resulting from Algorithm 4 using a perfect oracle 
to solve Maxyeax F;(X) is of order O(T?/3,/logT). 


In what follows, we argue why the above result holds. 


Notations 
We begin by introducing several notations used throughout this subsection. First, the notation 
fix, from Algorithm 4 satisfies, 


flk.t = fe Us Nxt), (3.14) 
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and we shall use them interchangeably in this section. We also omit the indication of Nx; 


throughout. 


Denote by x* € A* the optimal solution of the formulation in equation (3.7), and let &, be 
the optimal solution of the estimator at time t. We also denote the j’” coordinate of &; by 


Xj, 


For k € [K], we let € be the quantity 


K 
ex =D Hi(%}) ~ HC), (3.15) 
j=l 


that is, e, is the difference between the value of the estimator at the optimal solution and 
the k’” corner point of AF (called e;). 


We also let 6, be a time-dependent per-arm bound on the estimation error. We use 6,,; to 


simplify our notation when applying lemma 3.3.1. 


Preliminary Analysis - Bounding N; ; 
The regret is accounted for by the number of times an arm k € [K] is played is given, Nx ;. 
We therefore focus our attention on deriving an upper bound for Nx 7 by a function of T. In 


that follows we will derive a bound on Nx; under certain probabilistic conditions. 


The algorithm plays some arm k € [K’] at time step ¢ if and only if: 


(1) Vj #k:UCB,,>UCB,, 


as (3.16) 
(2) UCBuy > max Yfke (xt): 
xeAK kel 


Consider the case where the respective arm empirical means are close to the true arm means, 
that is, 


Vk € [K], Vx € [0,1] : |xe(x) — wx(x)| < 6x (2). (3.17) 
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Using the inequalities (3.16), we have in this case 


K K 
fieg(A) + JEP (Nee = D> Aja(&je) = D> Aj), (3.18) 
j=l jal 


where we used the fact that the maximizer of the right-hand side in equality (2) from 3.16 
is at least as good a candidate to optimize the estimator of the objective function as the true 


maximizer of the actual objective function. We can further reduce this inequality to 


K 
Heel) + ee + YEPB/Nee = D(a (%) - jl, (3.19) 
j=l 


or, simply, 


K K 
Okt + > ja + YE07-3/Nit = DY) Hi (X) — Hee (1) = & (3.20) 
j=l j=l 


Observe that the right-hand side of inequality (3.20) is non-negative, since it is the difference 


between the maximizer of the estimator at x* and the empirical mean of the k’” corner point. 


Set 6x4, = Vét?/3/Nx, for all k € [K]. The right-hand side of the above inequality is 
non-negative, whereas the left-hand side decays as t — ov. This results in the following 


inequality: 


K 
3f€12/3 Nay + > VJé23/Njy > &. (321) 


J=l,j#k 


Inequality (3.21) holds with the stated probability not only for a singular source k, but all 


for all sources k € [K]. We thus have a system of linear inequalities. 


As t — ov, these inequalities become equations since the left-hand side decays while the 


right-hand side is constant non-negative. Let g(t) € R* be a vector whose k"” coordinate 


pane 


is Vét2/3/Nx1, e € R* be a vector whose coordinates are €;, and consider the matrix 
AK Pa RKxK 


1 
fh 22h my il 

Agel. (3.22) 
Wd oy 38 


The system derived from (3.21) can be recast as Ax g(t) = e. The matrix Ax; is invertible, 


and its inverse structure is given by 


-] ; : 

: may FS 

(AR =) (3.23) 
K+) IJ 


implying its structure is only dependent on K. We can solve directly to find that 


K 
1 1 
2/3 el 
yet (Nea 5 €k 1K +2) die (3.24) 


We can express N;; as a function of f at the time of equality, allowing us to have a bound 


on the number of times a specific source is sampled individually: 


Nxt < ee z2/3 (3.25) 
(ex - Kad pa €) 
ee 
Pk 


where px is a problem dependent constant, implying N;., = O(t7/>) for all k € [K]. In other 
words, Ny 7 = O(T?!3). 
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Regret Analysis for a Corner Point 

We now proceed to bound the regret resulting from playing the corner points of A*. Using 
inequality (3.25), we can bound the regret in the case where |/ix +(x) — ux (x)| < Ox for all 
k € [K]. 


To use this bound, we write the total regret from sampling a corner point k as decomposition 


of three cases: 


¢ The case where |fi;,,(1) — wj(1)| < 6;, for all 7 € [K]. 
¢ The case where |fiz (1) — wxe(1)| > dg, and Ney > p xt? 
* The case where |/2x (1) — ue(1)| > Og, and Nyy < pxt?/? 


Formally, the total regret induced by playing any of the corner points is bounded by 


I(arm k is selected at time step f, () [j2(1) — wy) < 6¢,4)+ 
Jé[K] 


E(arm k is selected at time step f, |g (1) — ue) | > Ont, Nee > pat?!) + 


Ms IMs 


~ 
I 
an 


E(arm k is selected at time step ¢, |fig¢(1) — we (1)| > On1, Nee S prt??). 


Ms 
Me IM IM 


~ 
I 
an 
> 
I 
an 


(3.26) 


From the preliminary analysis above, we know that the first term in equation (3.26) cor- 
responds to the case where for all k € [K], Ney < p,t?/>. Therefore, the regret in this 


case 


E(arm k is selected at time step f, () ij2(1) — wy) < 6x4) < 


jelK] (3.27) 


MIM 
Me iM 


T 
I(Nit < prt?!) < Sper’. 
k=l 


rey 
ll 
— 
> 
ll 
har 


This reasoning also stands for the third term in equation (3.26), since Nx < prt?! 3. For the 
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second term, where Ny; > pxt?/>, the regret is bounded by 


E(arm k is selected at time step f, |fi¢.7(1) — ue(1)| > Ona, New > pet?!) < 


Mr 


~ 
I 
an 


‘ [supsctoaitie( — x(x)| > Ser, Nee > pit) < 
“SS 


M> IMs 
Ms IM IM 


lemma 3.3.1 
T K 
: Bi 2 = . 9672/3 
4-exp(-2Ni15;,) _ = > 4 exp(—2€t*!”) < 
jot eel : t=1 k=l 
replace 6x1 

7 3 

4K exp(-26) | gre 4K exp(—2€) - svn = 3K Vn exp(-2é), 

0 


(3.28) 


where we used numerical integration to evaluate the last step. We can see that this error is 
constant and is not dependent on 7. Combining this bound with the other two terms, results 
in equation 3.26 implying regret of order O(T7’?). 


Regret Analysis for the Oracle Estimator 

Now we consider the error resulting from sampling the point yielded by the estimator as the 
optimal point. We assume that the estimator is optimized using a perfect Oracle. At time 
step t, the error from playing the extra arm is given by 


K K 
| >) Bae Be0) — DHX) | (3.29) 
k=l k=1 


Assume that, at time step f, for all k € [K], we have sup,cjo,1) [@ke(™) — Hk (X)| < 2. Both 


fix (x) and jz (x) are increasing functions of x, so we can bound the error by 
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K K K 
| > Bet (Ber) — DY) Mk S Dla r(Bee) — MeO S 

k=1 k=1 k=1 

K K 
Dee) — we OES D7 Mer) - ee(D)I (3.30) 
k=1 k=1 

K 

Dd SUP Waka) — He SD) Mew: 
k=1 x€[0,1] k=1 


Once again, since this bound is only valid on specific time steps, we decompose the regret 


into three cases: 


* The case where supyejo,1] [Akr(%) — Mk (X)| < mer for all k € [K]. 
* The case where sup, ¢yo,1] [Mk,1(%) — Me(X)| < 7k, for all k € [K] and Nz; > pxt?? 
* The case where supyero,1} |[Axe(%) — Me (X)| < mer for all k € [K] and Nyy < p,t?? 


In the first case, the bound at time step t is given by (3.30). If we set ny, = Axt7!/3 log t, 


then by summing over all time steps, ¢ = 1, ..., 7, we have 


T 

yy Mk, 

ae =1 k=l (3.31) 
3 


For the third case, using the same reasoning as when analyzing the regret for a corner point, 


the bound is given by 
K 
>» pT? = O(T?!).. (3.32) 
k=l 


In the second case, suppose we choose 77x.,¢ = Ayt'/3 flog t. Then the bound is 
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T K 
ae r| sup. |axe(x) — we(x)| > Mees New > pst?! < 
k xe€[0,1] 

(3.33) 


12/3 logt 


=1 
K , T K 
>: eK Pk PB =4 > > 7 2P RAR 


Setting A, such that 2pKAz = 3, results in a bound of the form yj Ss 113 = O(T2/), 





We conclude that the oracle error is of order O(T?/3./logT) + O(T?/7) + O(T7/7) = 
O(T?3,/logT). Combining this result with our analysis of regret for a corner point, we 
end up with the results stated in 3.4.1. 


Regret Analysis - Beyond the Oracle Case 

As mentioned above, our argument is based on the notion that the maximizer of the estimator 
has 0 error due to the existence of an oracle estimator. In practice, if K = 2, then an exhaustive 
search approach of AX (which is reduced to the segment [0, 1] in this case) is plausible, 


leading to the following result: 


Lemma 3.4.2 For K = 2, regret rr(™(x+1)-ucp) resulting from Algorithm 4 using exhaus- 
tive search in the estimator optimization step is of order O(T?'? - \/logT). 


For K > 2, however, the exhaustive search approach is not feasible, and we instead employ 
the use of FKM algorithm to maximize our estimator. Note that for K = 2, an optimal 
method - proposed by Kleinberg in [10] - yields better performance than the (K + 1)-UCB 
algorithm using FKM. 


Hazan [7] showed that the regret performance of the FKM algorithm is of order O(T?/*), 
which immediately results in the following adaptation of our result to the case where we 
use FKM: 


Lemma 3.4.3 The regret rr(7(x+1)-ucg) resulting from Algorithm 4 using FKM in the 


estimator optimization step is of order O(T?!*). 


Zi 


3.4.4 Tuning (K + 1)-UCB Parameter é 

The parameter €, appearing in the expression for g; (t), controls the rate-of-decay of the UCB 
indices UCB, in Algorithm 4. Our analysis above for the oracle estimator shows that the 
error is of O(T7/3). As discussed above, we have two error components: a “misplay error”, 
resulting from sampling a corner point of AX instead of sampling the point determined by 
the oracle estimator; and a “probabilistic error,’ occurring when the conditions for lemma 
3.3.1 do not hold. 


This expected error, denoted E(T, K), is a function of K and is bounded by: 


K T 
E(T,K) <  3KVmexp(-2€) +) — 4". (3.34) 


fl et (oe ee he 1 &)? 





probabilistic error, equation (3.28) 





misplay error 


cera 


Note that the misplay error can be also written as \“ k=l ( po where H,, 24 
k~ Ke a 1 © 


a generalized harmonic number [15], and is of order O(log We can minimize by the 


bound by selecting € to be 


(3.35) 


Since e;’s are constant with respect to K, then a ea €x = O(1), and therefore we have 
yee - rea pe €)* = O(K). Consequently, we claim Eopt = O(log K). 


3.4.5 Discussing (K + 1)-UCB Versus Kleinberg’s Algorithms 

Both of Kleinberg’s algorithms (Table 2.1) are based on nearly-optimal sampling of the 
decision space in order to identify potential regions where the optimal solution lies. These 
algorithm do not exploit the gradient (or a gradient estimator) and base their decision on 


confidence regions, similarly to UCB. 


In our problem, this sampling approach becomes sub-optimal due to the censoring mecha- 
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nism incorporated into our model. Consider, for instance, an arbitrary decision point x € A*, 
whose k’” coordinate is given by x. The probability that any of the sampled sources are 
censored is | — an [P(X < x;,)! wae Note that if a specific coordinate is 0, we expect 


no value from the source corresponding to that coordinate. 


Kleinberg’s strategy for K = 2 is to create a uniform-grid over the [0,1] segment, 
{1/m,2/m,...,(m — 1)/m,1}, where m is some factor dependent on T [10]. The algo- 
rithm focuses its search to the best candidate (using MAB) by continuously increasing m 
and making the grid finer. If we sample the grid at j/m for some j, the probability the 
total sampled value is censored becomes is | — P(X, < j/m) - P(X2 < (1 — j/m)). The 
expression suggests that if the algorithm samples points near but not at the edges of A’ 
(that is, 7 is either very small or very large), the probability to lose information increases. 
As such, Kleinberg’s algorithm has a significant disadvantage in settings where the optimal 


solution resides near the edges of A. 


In the (K + 1)-UCB case, however, the probability to lose information due to censoring 
becomes 0. The trade-off, however, lies in the fact that for larger values of K, (K + 1)-UCB 
must spend many iterations sampling each individual source, whereas Kleinberg’s methods 


sample in the decision space in less-stable but potentially more “interesting” points. 


Zooming Algorithm Performance 

The zooming algorithm operates in phases i = 1, 2, ..., each requiring 2! time steps. If there 
are p € N phases, the total number of time steps required is 2?*! — 2. In this section, we 
assume that T = 2?+! — 2 for some p € N, and without loss of generality, assume the first 


center used by the zooming algorithm is e; € A*. 


Let n;(t) be the number of plays of the arm corresponding to the ball centered at e; at 
time step t. Appendix A.2 shows that if n;(t) < 4i — 1, then the ball covers the entirety 
of A*, and that the zooming algorithm uncovers K balls, whose centers are the vectors 
{ej per Without loss of generality, assume the balls are added to the collection in the order 
€1,€2,...,e€K. The key observation is that while at least of the one balls in the collection 
covers the entirety of A*, we will only ever sample the objective function at one of the 


points {e; eae 
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3.5 Addressing the K = 2 Case 
In this section, we focus on the case where K = 2. While the (K + 1)-UCB algorithm is still 
applicable in this case, we focus our attention on the twin-cost policy estimator introduced 


in subsection 3.3.2. 


Recall that we define the twin-cost policy estimator using a parameter 6 € [0,1]. This 
parameter defines 2-dimensional points (8,1 — 8)’ and (1 — 8,8)! that we can use in 
exploration steps. Each such 2-dimensional point defines an “‘arm’”, in a similar manner to 
how the corner points of A* defined “arms” for the (K +1)-UCB algorithm. Furthermore, we 
use “+]-arm” that is based on the twin-cost policy estimator (3.12). We dub this algorithm 
Mixed-(2 + 1)-UCB, which is detailed in Figure 5. Note that the “+1-arm” is indexed as 
(2 + 1) to highlight its difference from arms | and 2. 


The Mixed-(2 + 1)-UCB algorithm can be viewed as an extension of Algorithm 4. Setting 
the parameter to either 0 or 1 results in the (K + 1)-UCB algorithm for the K = 2 case 
exactly. We note that this algorithm may not be as appealing as the (K + 1)-UCB algorithm 
due to its sampling strategy. While the algorithm uses a more complex estimator (3.12) that 
is potentially more accurate, the algorithm is more prone to censoring effects, especially as 
B approaches the midpoint of 1/2. Also, the exact location of the optimal solution with the 
2-dimensional simplex may also affect the performance of this algorithm. 
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Algorithm 5: Mixed-(2 + 1)-UCB 

Input: T € N, (time-horizon) 

B = | (budget) 

geR 

B € [0, 1] (parameter) 

Output: 77(x41)-uce = {X:}/_, ((K + 1)-UCB policy) 

Vk = 1,2 7 UCBxo — co, Nx o —_— 0, fix.o <—0 

UCB (241),0 <—0 

for t= /, 2,..., 7 do 

k, — arg max{UCB, ;-1, UCB ;-1, UCB 241) 1-1} 
(B,1-p)" k,=1 

x — 4 (1- 8,8)" eo 





arg maxyeak F(x) ky = (241) 


Yr — ya Skt(X) (play arm i;) 
if k, # (2+ 1) then 

Nxt — Nk, t-1 + 1 

Alk,.t = (Mie 1 + yt) /Nk,t-1 

UCB — fiz, 4+ Vét2/3/N k,.t (update UCB index according to the index type) 
end 





UCB (x41) — Maxyeyx F;(X) (Optimize the estimator) 
end 











3.6 Limitations and Caveats 

In this section, we discuss several limitations and caveats of our proposed model and 
algorithm. We first address the assumptions presented in the model and propose some 
potential extensions to explore, and then proceed to discuss the limitations of our proposed 


algorithm (K + 1)-UCB from a practical perspective. 


3.6.1 Model Limitations 
In section 3.1, we gradually designed a mathematical model, including its associated as- 
sumptions, for the purpose of intelligence collection. A key assumption in our analysis 


is that fact that each intelligence source contains information that can be evaluated into 
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a real number in the range [0,1]. While this assumption addresses an organizational is- 
sue, requiring a consistent evaluation process of every piece of information collected, it is 
nonetheless crucial for an organization to address should they choose to employ our model 


and algorithm. 


Another important assumption, albeit not explicitly stated within the model, is the fact that 
intelligence sources cannot change significantly between collection attempts. Each time 
step t € [T] should represent an isolated period of time in which the decision maker can 
direct collection attempts and also observe the value of information collected. If extraction 


of information takes too long, the source of information might change or become irrelevant. 


3.6.2 (K + 1)-UCB Algorithm Limitations 

In subsection 3.4.3, we showed that Algorithm 4, dubbed the (K + 1)-UCB, provides 
satisfactory results in terms of regret. Since the algorithm samples the corner points of A*, 
it is able to collect uncensored information and construct a better estimator for the objective 
function. This property of the algorithm also presents a limitation when the number of 
intelligence sources K increases — since the algorithm has to spend multiple time steps to 
sample each source individually, if K ~ T, then (K + 1)-UCB will be only exploring — not 
exploiting. 


Another potential caveat in our design of the K + 1-UCB algorithm is the fact that the 
algorithm treats the set of intelligence source as a singular objective function. In practice, 
since we sample each intelligence source individually, the decision maker has a set of K 
values to evaluated, some of which are uncensored. As a result, a decision maker noting a 
source k € [K] is censored when paying a cost x,, can avoid paying the same or lower cost 
on that source, seeing that it was censored. This proposed decomposition of the objective 
function into its K component may be key to producing better results, while potentially 
sacrificing the ability to rigorously analyze any proposed algorithms. In this work, we 
made it our priority to address models and algorithms that can be accompanied by a sound 


mathematical analysis of their performance. 
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CHAPTER 4: 


Numerical Experiments and Results 





In this chapter, we study our results from Chapter 3 in various scenarios. Our experiments 
include both the estimation of the objective function in a censored environment (as detailed 


in section 3.3), as well empirical evaluation of algorithm implementations. 


Since we assume the distribution of the value of information of all sources is on [0, 1], we 


will typically address the following three types of intelligence sources’ distributions: 


¢ A triangular distribution, denoted by Tri(a, b,c) for a,b,c € [0,1], anda <c<b, 
whose mean is abt 

¢ A beta distribution, denoted by Beta(a, 8), with a, 8 € [0, 1], whose mean is wae 

¢ A combination of the above distributions, with some sources having a triangular 


distribution and others having a beta distribution. 


4.1 Evaluation of Censored Mean Estimators 

We first explore the context of censored environment and how to estimate the expected 
observable value, as discussed in section 3.3. The expected observable value, p(x), depends 
on the underlying distribution of the random variables representing the intelligence sources. 
Figure 4.1 shows, for example, that this function can be either convex or concave, depending 
on the specific distribution. This point is crucial, because the OCO framework from Chapter 


2 could be employed under the assumption of concavity - an assumption we do not make. 


4.1.1 Visualizing the Full-cost Policy Estimator 

In this subsection, we create various estimates of the censored mean in the [0,1] with 
varying amounts of samples. For each sample-size, we simulate the probability that the 
largest error between the estimates the true censored mean surpasses some threshold 6. The 
simulation results are based on 1000 replications for each sample size. In turn, this allows 


us to approximate the following expression from lemma 3.3.1, 
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Censored Mean 


P( sup |fa(x;n) — u(x)| > 6). 
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(a) Censored Mean of Beta(0.5, 0.5) 
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(c) Censored Mean of Tri(0, 0.8, 0.25) 


Figure 4.1. Censored means examples of different distributions and parame- 
ters. Observe that the censored mean function can be convex or concave, and 
its rate of increase depends on the specific distribution and its parameters. 
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(b) Censored Mean of Beta(0.7, 0.2) 
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Figure 4.2. Empirical probability for a full-cost policy estimator error to ex- 
ceed threshold 6 = 0.1 and its associated theoretical bound. Observe the 
theoretical bound is indeed an upper bound for our simulation, and that the 
simulated probability has the expected shape of decaying exponential. All 
results are based on 10° replications of simulated estimation. 
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Figure 4.2 shows the results of this experiment, as well as the associated theoretical bound 
from lemma 3.3.1. While the bound is not tight, we see it is effective for our analysis. Also, 
note that the simulated probability is decaying exponentially with the number of samples, 
regardless of the actual distribution. 


4.1.2 Visualizing the Twin-cost Policy Estimator 

Here we demonstrate the strengths and weaknesses of the twin-cost policy estimator dis- 
cussed in subsection 3.3.2. To do so, we observe the distribution of K = 2 intelligence 
sources and evaluate the objective function from equation (3.7), instead of noting each 


source’s individual censored mean. 
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(a) Beta(0.5,0.5), Beta(0.7, 0.2) (b) Tri(0, 0.8, 0.25), Tri(0, 1, 0.25) 


Figure 4.3. For the above distribution configurations, the plots show the 
max error per number of samples per arm, using the full-cost and twin-cost 
estimators. The maximal error is taken in the range [0.2,0.8] (6 = 0.2). 
Observe that for smaller sample sizes per arm, the twin-cost policy estimator 
is more accurate. 
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The twin-cost policy estimator is dependent on a parameter 8 € [0,1]. Our intention was 
to design an estimator that is capable of operating with less samples, focusing on the range 
[6, 1 — B]. While evaluating the objective function at point x € [0,1], ifx < Borx > 1-8, 


we expect the results to not be as good as in the full-cost policy case. 


Figure 4.3 shows two distribution configurations for the Beta and triangular distributions, 
and their respective maximal errors for each estimator. The maximal error is taken in the 
range [0.2,0.8] (that is, @ = 0.2). For each estimator and each sample size, we simulate 
both full-cost policy estimator and twin-cost policy estimator and compare the maximal 
error in the range [8, 1 — 8]. The figure shows that for smaller sample sizes, the twin-cost 


policy estimator is more accurate. 
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Figure 4.4. For K = 2 sources, with distributions Tri(0,0.8,0.25) and 
Tri(0, 1, 0.25), respectively, we show the maximal error of the full-cost policy 
estimator and the twin-cost policy estimator per number of samples per arm. 
Each sub figure shows a different 8 value used for the twin-cost estimator. 
The max error is taken in the range [8, 1 — 8] only. 
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As mentioned above, the performance of the twin-cost estimator depends on the specific 
distributions and the exact location of the optimal solution in A*. If the optimal solution is 
either of the corner points (1,0) or (0,1), the twin-cost policy will not be able to collect 


samples that capture the optimal solution. 


We demonstrate this observation in Figure 4.5. Here, we use the distributions Beta (0.2, 0.4) 
and Beta(0.3, 0.6) and 6 = 0.2. Unlike the previous figures, here we focus on the maximal 
error in the region where x < 6 or x > 1 — f, that is, beyond the region where the twin-cost 
policy estimator is expected to prove useful. We can see that the twin-cost policy estimator 
error shows an almost constant error as the number of samples per arm increases, whereas 
the full-cost policy estimator shows a decline in the maximal error. As a consequence, the 


regret grows linearly under this configuration. 
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Figure 4.5. For K = 2 sources, with distributions Beta(0.2,0.4) and 
Beta(0.3, 0.6), respectively, we show the maximal error of the full-cost pol- 
icy estimator and the twin-cost policy estimator per number of samples per 
arm, with 6 = 0.2. The max error is taken in the range where x < £ or 
x>1-£. 
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4.2 Algorithm Evaluation and Comparison 

In this section, we evaluate the performance of (K + 1)-UCB algorithm (Algorithm 4), and 
compare it to the performance of other continuum MAB algorithms, both by Kleinberg 
[10], [11]. The original algorithms by Kleinberg et al. did not account for censoring of the 
samples. As a result, we add additional measures in our implementation of these algorithms 
in the form of censoring, following the same procedure as described in Chapter 3. Note that 


the algorithm detailed in [10] is suitable only in the case where K = 2. 


Throughout our experiments, we use simulation over a discrete grid in AX to uncover the 
optimal solution of equation (3.7). This step is required in order to evaluate and plot the 
regret of an algorithm. The plots shown in this section are based on the averaged simulated 
regret resulting from running each algorithm 100 times, and comparing the accumulated 


rewards (value of information collected) to the simulated optimal value. 


In this section, we often discuss the performance of implemented algorithms. Unless noted 
otherwise, by performance we refer to the mean regret of the algorithm in a particular 


setting, as detailed in Chapters 2 and 3. 


4.2.1 The K = 2 Case 

We first address the K = 2 case where exactly 2 intelligence sources are present. Here, 
we explore both the (K + 1)-UCB algorithm (Algorithm 4) and the Mixed-(2 + 1)-UCB 
algorithm (Algorithm 5). 


We use different configurations that determine the distribution of both sources, as detailed 
in Table 4.1. One configuration, referred to as All-beta, uses information value distributions 
of Beta(0.2,0.4) and Beta(0.3, 0.6), and one configuration, referred to as All-triangular, 
uses information value distributions of Tri(0,0.7,0.2) and Tri(0, 0.8, 0.3) (referred to as 
All-triangular). We also explore a mixture of distribution, referred to as Mixed-1-And-1, 


with one source having a Beta distribution, and one source having a triangular distribution. 
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Table 4.1. Experiment Configuration for K = 2 














Distribution Source 1 Source 2 Optimal Solution Optimal Value 
All-beta Beta(0.2,0.4) Beta(0.7, 0.6) (0, 1) 0.538 
All-triangular Tri(0,0.7,0.2) Tri(0, 0.8, 0.3) (0.4, 0.6) 0.5 
Mixed-1-And-1 Tri(0,0.6,0.4) Beta(0.4, 0.6) (0.525, 0.475) 0.434 











Mixed-(2 + 1)-UCB Parameter 8 Tuning 
Figures 4.6 and 4.7 depict the Mixed-(2 + 1)-UCB algorithm (Algorithm 5) with parameter 
values of 6 € {0.2, 0.3, 0.4} for the two different distribution configurations, respectively. 


We can see that the performance of the algorithm per choice of 6 depends on the particular 
distribution. For the All-beta configuration, whose optimal solution uncovered through 
simulation is either of the corner points of A’, the performance of the algorithm for all 
values of is similar, with 6 = 0.2 yielding the best results. Note that the corner points 
(0, 1) and (1,0) cannot be captured by the twin-cost policy estimator, which explains the 


relatively similar performance for all three parameter values. 


In the All-triangular configuration, however, significant differences in performance are 
observed, with 6 = 0.4 producing the best results. Through simulation, we know that the 
optimal solution in this case is (0.4,0.6) € A?. As such, we can except that the choice 
B = 0.4 will produce the best results. 


We also note that both figures 4.6 and 4.7 show regret that linearly growing. As we use the 


twin-cost policy estimator, this behavior is expected. 
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Figure 4.6. Algorithm performance depends on both the specific choice of 
B. For the All-beta configuration, 6 = 0.2 yields the best performance. Note 
the linear growth of the regret. 
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Figure 4.7. Algorithm performance depends on both the specific choice of 
B. For the All-triangular configuration, 8 = 0.4 yields the best performance. 
Note the linear growth of the regret. 
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(K + 1)-UCB Parameter é Tuning 

In Chapter 3, when presenting the (K + 1)-UCB algorithm, we discussed the upper bound 
on the expected error. We also derived an expression for an optimal value of the parameter 
€, for which the upper bound is minimized. Figure 4.8 shows a demonstration of this 
analysis for the All-beta configuration, with the curve to the derived constant yielding the 
best performance in terms of mean regret. We remind the reader that derivation of this 
constant is not applicable in practice, as it requires knowledge of the distribution of each 
intelligence source. As expected, the figure shows sub-linear growth of regret that is of 
order of approximately O(T?/ 3flogT ). 
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Figure 4.8. Demonstration of the performance of (K + 1)-UCB for different 
values of €. The best performing curve corresponds to the optimal value of 
é, derived in Chapter 3. Note the regret is » O(T7/7 log T). 


Algorithm Performance 
In this subsection, we show the performance of different algorithms in the above settings. 
Our results are based on simulation of both algorithms across 100 replications for each 


setting. We separately simulate each scenario on a discretized grid approximating A? to 
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locate the optimal solution against which the regret is computed. Note that for the (K + 1)- 
UCB algorithm, we explore two variants - one that uses FKM [7] to optimize the estimator, 
and another that employs an Oracle to optimize the estimator. 


Figure 4.9 shows the performance of (K + 1)-UCB with 100 steps of FKM gradient-like 
ascent [7] versus Kleinberg’s optimal sampling strategy for K = 2 [10] in the All-beta 
setting. It is apparent that in the proposed algorithm is superior in terms of mean regret, 
with Kleinberg’s algorithm sampling often-censored points in A‘, resulting in a linear 
regret. Moreover, while the Mixed-(2 + 1)-UCB algorithm shows linear regret, whereas 
Kleinbergs’ algorithm and both variants of (K + 1)-UCB show sub-linear regret. 
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Figure 4.9. Mean regret performance of (2 + 1)-UCB algorithm, Kleinberg’s 
algorithm, and the Mixed-(2+1)-UCB algorithm in the All-beta setting (Table 
4.1). Note how both variants of (K + 1)-UCB algorithms (with FKM and 
with an Oracle) outperform Kleinberg’s algorithm and the Mixed-(2+1)-UCB 
algorithm. Note the linear vs. sub-linear regrets for each algorithm. 
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Figure 4.10 depicts a similar comparison of the algorithms, this time in the All-triangular 
setting. Here, the optimal solution lies within A? at (0.4, 0.6). We can that in this example, 
Kleinberg’s algorithm for K = 2 and the Mixed-(2 + 1)-UCB algorithm outperform the two 
variants of (K + 1)-UCB. 


In contrast, Figure 4.11 shows another comparison where the optimal solution is not a 
corner point (see Table 4.1). Here, we observe the opposite effect, where (K + 1)-UCB’s 
performance across the two variants perform better than both Kleinberg’s algorithm and 
the Mixed-(2 + 1)-UCB algorithm. Note how in both figures 4.10 and 4.11 depict regret 
sub-linear growth for the (K + 1)-UCB algorithm for both variants. 
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Figure 4.10. Mean regret performance the two variants of (2+ 1)-UCB algo- 
rithm in the All-triangular setting (Table 4.1). The yellow curve corresponds 
to the variants using an Oracle. Note the sub-linear regret growth for (K +1)- 
UCB in both variants. 
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Mean Regret 


Figure 4.11. Mean regret performance the two variants of (2 + 1)-UCB al- 
gorithm in the Mixed-1-And-1 setting (Table 4.1). The yellow curve corre- 
sponds to the variants using an Oracle. Note the sub-linear regret growth 
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for (K + 1)-UCB in both variants. 


4.2.2 The K > 2 Case 


In this subsection, we focus our attention to comparison of the (K + 1)-UCB algorithm with 
the zooming algorithm [11]. With K > 2, the use of an Oracle as part of running (K + 1)- 
UCB algorithm becomes impractical, and we instead use the FKM variants discussed in 
Chapter 3. We observed that the implementation of the FKM variant is quite fast, even if 


we perform tens of iterations each time we invoke the optimizer. 


In Chapter 3, we already discussed how the zooming algorithm operates, and why we 


believe its performance is not expected to be as good as the performance of the (K + 1)- 
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UCB algorithm. We also note that parts of our analysis are detailed in Appendix A.2. Our 
simulations initialize the zooming algorithm from a random corner of A¥, as this approach 


was observed to be the best performing amongst all those tested. 


Our results are based on simulation of both algorithms across 100 replications for each 
setting. We separately simulate each scenario on a discretized grid approximating AX 
to locate the optimal solution against which the regret is computed. Consequently, we are 
limited by the maximal dimension K for which we can readily compute the optimal solution, 
and present here results only for K = 4. The specific distribution configurations are depicted 
in Table 4.2. For each configuration, we use equation (3.35) to optimally tune the constant 
€ for the (K + 1)-UCB algorithm. 











Q | — Euclidean Zooming - regret F. 
= —— K+1-UCB (xi=0.71,FKM-steps=100) - regret ae 
oss 
x 
8 4 Pa 
va 
i 

2 ri 

@o 
se ra 
5 so 
& Z 
re 3S 7 rl 
3 / 
= rs 

rf 
io: | ye aa 
vt vA eine 
Fa Pal 
Pal ee 
iS: | ed ae — 
io e eee 
we 222 i 
a je 
o 4 LaF 











Time Step 


Figure 4.12. Mean regret performance of (K + 1)-UCB algorithm versus the 
zooming algorithm in the All-beta setting (Table 4.2). The black dotted 
curve denotes the a polynomial regression fit of the form O(T?/*). 
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Figures 4.12, 4.13 and 4.14 depict the mean regret across simulations. The figures show the 
regret resultant from running both the zooming algorithm and (K + 1)-UCB, when equipped 
with the FKM algorithm [7] as the optimizer of the estimator (see Chapter 3 for further 
details). We have also highlighted a fitted regression curve of order T?/*, showing the regret 
of (K + 1)-UCB is in accordance to result 3.4.3. 
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Figure 4.13. Mean regret performance of (K + 1)-UCB algorithm versus the 
zooming algorithm in the All-triangular setting (Table 4.2). The black dotted 
curve denotes the a polynomial regression fit of the form O(T?/4). 


47 





S — Euclidean Zooming - regret / 
QS 7 —— K+1-UCB (xi=0.94,FKM-steps=100) - regret ra 





Mean Regret 
150 200 
l l 


100 
L 


50 














Time Step 


Figure 4.14. Mean regret performance of (K + 1)-UCB algorithm versus the 
zooming algorithm in the Mixed setting (Table 4.2). The black dotted curve 
denotes the a polynomial regression fit of the form O(T?/4). 


Effect of K on the Regret 

In order to explore the effect of K on the regret, we construct a simulation of a scenario 
where adding more intelligence sources does not change the optimal solution. To do so, we 
have one source whose expected value is very high, and all other K — 1 sources have a very 
low expected value. Figure 4.15 shows the results of this experiment. Note how all curves 
depict regret of approximate order of O(T*/*), but becomes increasingly larger for larger 


values of K due to more time steps being spent on exploration. 
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Figure 4.15. Mean regret performance of (K+1)-UCB algorithm for different 
values of K. The simulation is constructed such that the optimal value is 
identical in all cases. Note that for K, the regret is of order O(T?/*). 


4.3 Summary 

In this chapter, we presented numerical simulation results that trend the various theoretical 
results and claims presented in Chapter 3. Our demonstrations used simulation of the 
decision space to uncover the true optimal solution, allowing us to compute the regret 


where appropriate, and observe the performance of both existing and proposed algorithms. 


Importantly, our results are consistent with the ones presented in Chapter 3. Moreover, in 
our experiments, we observed that the performance of the (K + 1)-UCB algorithm was 
better than that of both Kleinberg’s sampling algorithm for K = 2 [10] and the zooming 
algorithm [11]. For the (K +1)-UCB algorithm, the growth of regret shown in this chapter is 
either O(T?/) for the Oracle variant, or O(T>/*) for the FKM variant. Note that the Oracle 
variant is not practical for higher dimensions (K > 2), while the FKM variant is relatively 


easy and fast to compute, even for tens of iterations. We also observed linear growth of 


49 


regret for the Mixed-(2 + 1)-UCB algorithm. 


Table 4.2. Experiment Configuration for K > 2 














Distribution Per-Arm Mean Optimal Value Is Solution a Corner? 
All-beta (0.333, 0.5, 0.416, 0.466) 0.5 Yes 
All-triangular (0.233, 0.433, 0.4, 0.5) 0.522 No 
Mixed (0.233, 0.433, 0.4, 0.75) 0.75 Yes 
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CHAPTERS: 
Conclusion 





This chapter summarizes our main results. Chapter 3 presented our proposed model and 
discussed its limitations, the methodology used to tackle the problem that motivates the 
model, and provided analysis of our proposed algorithm, (K + 1)-UCB. In Chapter 4, 
we have verified our results numerically through simulation and comparison to existing 


approaches. 


We have shown that by choosing to sample uncensored points we are able to slowly explore 
the continuous decision space, while simultaneously improving upon an exploitation point. 
The exploitation point can be uncovered using an oracle or by using censor-free online 
optimization technique, such as the FKM algorithm [7]. The proposed was shown to be 
useful, yielding regret of order O(T*/*) using FKM algorithm. Consequently, (K + 1)- 
UCB is expected to outperform high-dimension sampling strategies such as the zooming 
algorithm [11]. 


We also showed that the optimal confidence interval constant used by (K + 1)-UCB is of 
order O(log K). Although this constant cannot be uncovered under normal circumstances 
as it depends on the specific distribution, the analysis shows that (K + 1)-UCB exploits the 
internal relationship between the K intelligence sources imposed by the continuous decision 


space, resulting in sub-linearity in K. 


An important emphasis of our work is the ability to provide rigorous analysis of the 
algorithms. We acknowledge that this self-induced limitation has led us to leave aside 
certain appealing aspects of the model, such as combining our +1 approach with other 
algorithms, or exploiting the distributed nature of the problem (using K inputs, instead of 


treating them as a singular function). We leave those endeavors to our future work. 
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APPENDIX: Supplementary Proofs and Notes 





A.1 Proof of Lemma 3.3.1 


For any 6 > 0, given n IID uncensored samples of a source: 
P( sup |A(x;n) ~ p(x)| > 6) < 4e7"™ (AL) 


xe[0,1] 


A.1.1 Proof 


Let there be n uncensored samples of the random variable X, denoted by X = {Xi bey: 


Let F(x;n) be the sample cumulative distribution function of X at x € [0, 1], 


F(x;n) = , » I(x; <x). (A.2) 


Consider some x € [0, 1]. Without loss of generality, assume the samples in X are ordered 
such that xj < x2 <... S XR SX < X41 < ... < Xn. Consequently, we have F(x;n) =k /n. 
In addition, observe that 


k k 


Hs 1 ; 1 k 
[ F(z;n)dz = n 24H -(i-1) an Dine (A.3) 


where we set x9 = 0. Note that fi(x;n) = , a x;, and so we can conclude that: 


hem [ “(PG n) — Plen))dz. (AA) 


Similarly, using the definition in equation 3.1, we have u(x) = ie zf (z)dz. Using integration 
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by parts, it is apparent that 


win) = (Fe) - Fede, (AS) 
0 
and by combining these two results, we have 


|M(x) — Aas n)| = 
| [ (a) Foams - [ (F@)-Flamdl< — (A6) 
0 0 
2x|F (x) — F(x:n)|. 
The above expression holds for some arbitrary x € [0, 1]. We can bound the probability the 


deviance between /i(x;n) and u(x) is greater than some 6 > 0, by using the Dvoretzky- 
Kiefer-Wolfowitz (DKW) inequality [16], 


P( sup |A(xin) — w(x) > 6) < 2-2-6", (A.7) 


xe[0,1] 


A.2 Zooming Algorithm on the Simplex Space 

The work in [11] describes a general method to address the continuous space MAB through 
an algorithm referred to as the zooming algorithm. In this section, we discuss the imple- 
mentation of the algorithm in the context of model and highlight some of its properties that 


are specific to our implementation. 


In this section, we refer to a generalized version of the simplex space, defined via 


AK = {x € R*|x > 0,1’x = B}, (A.8) 


where B is the available budget (see Chapter 3). 


The zooming algorithm starts with a single ball centered atc, € A* that covers the entirety of 


A‘. As time progresses, the radius of the ball r; (t) shrinks, while increasing the confidence 
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of the estimation of the objective value at the point c;. At any time step tf, r;(t) is not 
sufficient to cover A‘, the algorithm must find a new vector cz € A¥ such that both balls 
cover A*, or simply, AX Cc B(c1,11(t)) U Beco, r2(t)). 


The radius r;(t) is only dependent on the algorithm actions and is given by r;(t) = , Frenon 
where i is the phase of the algorithm and n;(t) is the number of times the ball centered at 


c; is sampled. For further details, we refer the reader to [11]. 


A.3 Covering Oracle Implementation 
As part of the zooming algorithm implementation, we must implement the covering oracle. 
Given a set of balls B(t) = {B(c;, rj(t)) Yo 


not covered by the User B(c;,r;(t)), or alert that no such point exists. 


we must be able to find a point v € A* that is 
To do so, we propose the formulation 


J 
2 
ag 2 live 
: (A.9) 
subject to 


Vj =1,2,..0: [lv—eyll5 = 700. 


This formulation is highly non-convex, even when J = 1. However, recall we are addressing 


this formulation incrementally, starting with a single ball (J = 1). 


A.3.1 The J = 1 Case 

If J = 1, then we have a single ball to cover A*. Note that we have ||v — ¢;| 3 < 2B, where 
we used the fact that within the simplex A*, the £,-norm is bounded by 1. Consequentially, 
if r-(t) > 2B? (or, equivalently, n j(t) < 4i/ B? — 1), the above formulation has no solution. 
This observation allows us to skip addressing the optimization when these conditions hold. 


The choice of c; is also critical, because the bound on ||v — ¢, IIs can be tighter than 2B? 
for specific choices of c;. Our implementation randomly chooses ¢; to be drawn from 
among {B - Nee As such, the zooming algorithm is evaluated on equal grounds as the 


35 


(K + 1)-UCB algorithm, since B - ex guarantees an uncensored sample. 


Since ¢; = B- e;, the formulation is reduced to 


max ||v — B- ex||5 
veR*K 


subject to 
llv — ex|I5 = r7@. (A.10) 
lv=B 


v>0. 


While this formulation is still not convex, observe that for any i # k, ||B-e;—B-e,||} = 2B?, 
also B-e; € A*. Therefore, B-e; (i # k) is an optimal solution that holds while r7(t) < 2B?. 


A.3.2 Beyond J = 1 Case 


Using the above analysis, we can make the following observation: the first K centers will 
be the B-scaled standard basis of the K-dimensional Euclidean space. 


We can show this by induction. Assume the first J < K are all B-scaled standard basis 
vectors {B - eet Let t; be the time step where the union of balls Ce B(B - e;,r;(ty)) 
does not cover A*, and we must therefore find a new center ¢)41 using the formulation in 


equation (A.9). 


Let B - ex be a B-scaled basis vector of the K-dimensional Euclidean space, such that 
ex # e; for j =1,..., J. Note that pa ||v-B- e,|l5 < 2B? for any v € A¥, and for choice 
v = B- ex, we have exactly ae Ilex - ejll5 = 2JB’. As such, B - ex is the solution of 


equation (A.9), if it is a feasible solution. 


Immediately we have B- e, € A*, so we must only confirm that ||B-e, —B- ell > r(ty). 
Recall that if r-(t) > 2B? the formulation has no solution, i.e., the simplex is covered by 
the union of ball. Since we assumed the simplex is not completely covered, then r-(t) < 
2B? = ||B-e,-—B- e;||, and thus, B - e; is indeed a feasible solution. 


This analysis is especially interesting in the context of our comparison to the Algorithm 
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4 where B = 1. While Algorithm 4 is specifically designed to sample the simplex space 
A* at standard basis points of the K-dimensional Euclidean space, the zooming algorithm 
uncovers these sampling points, instead. We will note that the zooming algorithm will 
eventually sample additional points from A¥ as time steps progress, whereas Algorithm 4 


will not. 


A.3.3. Numerical Approach 


Our numerical approach includes the following steps, assuming B = 1: 


Step 1: Check Condition 
First, we verify the J = 1 condition ri (t) < 2B. If that condition does not hold, the simplex 
is covered by the first ball B(e,, 7; (t)) and we do not to explore further. 


Step 2: Find Another Ball Center 

Otherwise, we use the Augmented Lagrangian method [17] to find a solution to the above 
optimization problem. We limit the number of iterations for the Augmented Lagrangian by 
some value /,4,, and if a feasible solution is not found — we conclude the simplex is covered 
by the collection of balls in our possession. If a solution is found, and it is different than all 
other ball centers currently in the collection, we add it to our collection. 
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